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Abstract 

The  success  of  reinforcement  learning  in  practical  problems  depends  on  the  ability  to  combine  function 
approximation  with  temporal  difference  methods  such  as  value  iteration.  Experiments  in  this  area  have 
produced  mixed  results;  there  have  been  both  notable  successes  and  notable  disappointments.  Theory  has 
been  scarce,  mostly  due  to  the  difficulty  of  reasoning  about  function  approximators  that  generalize  beyond  the 
observed  data.  We  provide  a  proof  of  convergence  for  a  wide  class  of  temporal  difference  methods  involving 
function  approximators  such  as  A:-nearest-neighbor,  and  show  experimentally  that  these  methods  can  be 
useful.  The  proof  is  based  on  a  view  of  function  approximators  as  expansion  or  contraction  mappings. 
In  addition,  we  present  a  novel  view  of  approximate  value  iteration:  an  approximate  algorithm  for  one 
environment  turns  out  to  be  an  exact  algorithm  for  a  different  environment. 
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1  Introduction  and  background 

The  problem  of  temporal  credit  assignment  —  deciding  which  of  a  series  of  actions  is  responsible  for  a 
delayed  reward  or  penalty  —  is  an  integral  part  of  machine  learning.  The  methods  of  temporal  differences 
are  one  approach  to  this  problem.  In  order  to  learn  how  to  select  actions,  they  learn  how  easily  the  agent 
can  achieve  a  reward  from  various  states  of  its  environment.  Then,  they  weigh  the  immediate  rewards  for 
an  action  against  its  long-term  consequences  —  a  small  immediate  reward  may  be  better  than  a  large  one, 
if  the  small  reward  allows  the  agent  to  reach  a  high-payoff  state.  If  a  temporal  difference  method  discovers 
a  low-cost  path  from  one  state  to  another,  it  will  remember  that  the  first  state  can’t  be  much  harder  to 
get  rewards  from  than  the  second;  in  this  way,  information  propagates  backwards  from  states  in  which  an 
immediate  reward  is  possible  to  those  from  which  the  agent  can  only  achieve  a  delayed  reward. 

One  of  the  first  examples  of  a  temporal  difference  method  was  the  Bellman-Ford  single-destination 
shortest  paths  algorithm  [Bel58,  FF62],  which  learns  paths  through  a  graph  by  repeatedly  updating  the 
estimated  distance- to-goal  for  each  node  based  on  the  distances  for  its  neighbors.  At  around  the  same 
time,  research  on  optimal  control  led  to  the  solution  of  Markov  processes  and  Markov  decision  processes  (see 
below)  by  temporal  difference  methods  [Bel61,  Bla65].  More  recently  [Wit77,  Sut88,  Wat89],  researchers  have 
attacked  the  problem  of  solving  an  unknown  Markov  process  or  Markov  decision  process  by  experimenting 
with  it. 

Many  of  the  above  methods  have  proofs  of  convergence  [BT89,  WD92,  Day92,  JJS94,  Tsi94].  Unfortu¬ 
nately,  most  of  these  proofs  assume  that  we  represent  our  solution  exactly  and  therefore  expensively,  so  that 
solving  a  Markov  decision  problem  with  n  states  requires  0{n)  storage.  On  the  other  hand,  it  is  perfectly 
possible  to  perform  temporal  differencing  on  an  approximate  representation  of  the  solution  to  a  decision 
problem  —  Bellman  discusses  quantization  and  low-order  polynomial  interpolation  in  [Bel61],  and  approx¬ 
imation  by  orthogonal  functions  in  [BD59,  BKK63].  These  approximate  temporal  difference  methods  are 
not  covered  by  the  above  convergence  proofs.  But,  if  they  do  converge,  they  can  allow  us  to  find  numerical 
solutions  to  problems  which  would  otherwise  be  too  large  to  solve. 

Researchers  have  experimented  with  a  number  of  approximate  temporal  difference  methods.  Results 
have  been  mixed;  there  have  been  notable  successes,  including  Samuels’  checkers  player  [Sam59],  Tesauro’s 
backgammon  player  [Tes90],  and  Lin’s  robot  navigation  [Lin92].  But  these  algorithms  are  notoriously  un¬ 
stable;  Boyan  and  Moore  [BM95]  list  several  embarrassingly  simple  situations  where  popular  approximate 
algorithms  fail  miserably.  Some  possible  recisons  for  these  failures  are  given  in  [TS93,  Sab93]. 

Several  researchers  have  recently  conjectured  that  some  classes  of  function  approximators  work  more 
reliably  with  temporal  difference  methods  than  others.  For  example,  Sutton  [SS94]  has  provided  experimental 
evidence  that  linear  functions  of  coarse  codes,  such  as  CMACs,  can  converge  reliably  for  some  problems. 

He  has  also  suggested  that  online  exploration  of  a  Markov  decision  process  can  help  to  concentrate  the 
representational  power  of  a  function  approximator  in  the  important  regions  of  the  state  space. 

We  will  prove  convergence  for  a  significant  class  of  approximate  temporal  difference  algorithms,  including 
algorithms  based  on  fc-nearest-neighbor,  linear  interpolation,  some  types  of  splines,  and  local  weighted 
averaging.  These  algorithms  will  converge  when  applied  either  to  discounted  decision  processes  or  to  an 
important  subset  of  nondiscounted  decision  processes.  We  will  give  suflhcient  conditions  for  convergence 
to  the  exact  value  function,  and  for  discounted  processes  we  will  bound  the  maximum  error  between  the 
estimated  and  true  value  functions. 

2  Definitions  and  basic  theorems 

Our  theorems  in  the  following  sections  will  be  based  on  two  views  of  function  approximators.  First,  we  will 
cast  function  approximators  as  expansion  or  contraction  mappings;  this  distinction  captures  the  essential 
difference  between  approximators  that  can  exaggerate  changes  in  their  training  values,  like  linear  regression 
and  neural  nets,  and  those  like  fc-nearest-neighbor  that  respond  conservatively  to  changes  in  their  inputs. 
Second,  we  will  show  that  approximate  temporal  difference  learning  with  some  function  approximators  is 
equivalent  to  exact  temporal  difference  learning  for  a  slightly  different  problem.  To  aid  the  statement  of 
these  theorems,  we  will  need  several  definitions.  or 
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Definition:  Consider  a  vector  space  S  and  a  norm  ||  - 1|  on  S.  If  S  is  closed  under  limits  in  ||  •  ||,  then  S  is 
a  complete  vector  space.  Unless  otherwise  noted,  all  vector  spaces  will  be  assumed  to  be  complete. 

Examples  of  complete  vector  spaces  include  the  real  numbers  under  absolute  value  and  the  n-vectors  of 
real  numbers  under  Manhattan  Euclidean  (E^),  and  max  (E°°)  norms.  The  max  norm,  defined  as 

||a||oo  =  m^^|ai  | 

and  the  weighted  max  norm  with  weight  vector  W,  defined  as 

||a||vv  =  max-^|ai  | 

I  Wi 

are  particularly  important  for  reasoning  about  Markov  decision  problems. 

Definition:  A  function  /  from  a  vector  space  S  to  itself  is  a  contraction  mapping  if,  for  all  points  a  and  h 
in  S,  II  /(a)  —  f{h)  ||  <  7|1  a  —  6  ||.  Here  7,  the  contraction  factor  or  modulus,  is  any  real  number  in  [0, 1).  If 
we  merely  have  ||  f(a)  —  f(b)  ||  <  ||  a  —  6 1|,  we  call  /  a  nonexpansion. 

For  example,  the  function  f(x)  =  5  +  |  is  a  contraction  with  contraction  factor  |.  The  identity  function 
is  a  nonexpansion.  All  contractions  are  nonexpansions. 

Lemma  2.1  Let  C  and  D  be  contractions  with  contraction  factors  7  and  6  on  some  vector  space  S.  Let  M 
and  N  be  nonexpansions  on  S.  Then 

•  C  o  N  and  N  oC  are  each  contractions  with  contraction  factor  7 

•  C  o  D  is  a  contraction  with  factor  jb 

•  M  o  N  is  a  nonexpansion. 

Definition:  A  point  a;  is  a  fixed  point  of  the  function  /  if  f{x)  —  x. 

A  function  may  have  any  number  of  fixed  points.  For  example,  the  function  on  the  real  line  has  two 
fixed  points,  0  and  1;  any  number  is  a  fixed  point  of  the  identity  function;  and  x  +  1  has  no  fixed  points. 

Theorem  2.1  (Contraction  Mapping)  LetS  be  a  vector  space  with  norm  H-H.  Suppose  f  is  a  contraction 
mapping  on  S  with  contraction  factor  7.  Then  f  has  exactly  one  fixed  point  x*  in  S.  For  any  initial  point 
xo  in  S,  the  sequence  xo, /(xq), /(/(xq)),  . . .  converges  to  x* ;  the  rate  of  convergence  of  the  above  sequence 
in  the  norm  ||  •  ||  is  at  least  7. 

For  example,  the  function  5  +  |  has  exactly  one  fixed  point  on  the  real  line,  namely  x  =  10. 

If  5  is  a  finite-dimensional  vector  space,  then  convergence  in  any  norm  implies  convergence  in  all  norms. 

Definition:  A  Markov  decision  process  is  a  tuple  {S,A,5,c,j,So).  MDPs  are  a  formalism  for  describing 
the  experiences  of  an  agent  interacting  with  its  environment.  The  set  S  is  the  state  space]  A  is  the  action 
space.  At  any  given  time  t,  the  environment  is  in  some  state  x*  E  S.  The  agent  perceives  Xj,  and  is  allowed 
to  choose  an  action  a*  E  A.  The  transition  function,  6  (which  may  be  probabilistic),  then  acts  on  Xt  and  at 
to  produce  a  next  state  Xt+i,  and  the  process  repeats.  So  is  a  distribution  on  S  which  gives  the  probability 
of  being  in  each  state  at  time  0.  The  cost  function,  c  (which  may  be  probabilistic),  measures  how  well  the 
agent  is  doing:  at  each  time  step  t,  the  agent  incurs  a  cost  c{xt,at).  The  agent  must  act  to  minimize  the 
expected  discounted  cost  E\Y]'^o  7^ c(xt ,  at)]]  7  E  [0,1]  is  called  the  discount  factor.  We  will  write  V*{x) 
for  the  optimal  value  function,  the  minimal  possible  expected  discounted  cost  starting  from  state  x.  We 
introduce  conditions  below  under  which  V*  is  unique  and  well-defined. 

We  will  say  that  an  MDP  is  deterministic  if  the  functions  c(x,a)  and  6(x,a)  are  deterministic  for  all  x 
and  a,  i.e.,  if  the  current  state  and  action  uniquely  determine  the  cost  and  the  next  state.  An  MDP  is  finite 
if  its  state  and  action  spaces  are  finite;  it  is  discounted  if  7  <  1.  We  will  call  an  MDP  a  Markov  process 
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if  |A1  =  1.  In  a  Markov  process,  we  cannot  influence  the  expected  discounted  cost;  our  goal  is  merely  to 

compute  it.  _  •  ■  rc  •  4.  * 

There  are  several  ways  that  we  can  ensure  that  exists.  In  a  finite  discounted  MDP,  it  is  suflicient  to 

require  that  the  cost  function  c{x,a)  have  bounded  mean  and  variance  for  all  x  and  a.  For  a  nondiscounted 
MDP,  even  if  c  is  bounded,  cycles  may  cause  the  expected  total  cost  for  some  states  to  be  infinite.  So,  we 
are  usually  interested  in  the  case  where  some  set  of  states  G  is  absorbing  and  cosi-free:  that  is,  if  we  are 
in  G  at  time  t,  we  will  be  in  G  at  time  t  +  1,  and  c(a;,  a)  =  0  for  any  x  G  G  and  a  G  A.  Without  loss  of 
generality,  we  may  lump  all  such  states  together  and  replace  them  by  a  single  state.  Suppose  that  state  1  of 
an  MDP  is  absorbing.  Call  an  action  selection  strategy  proper  if,  no  matter  what  state  we  start  in,  following 
the  strategy  ensures  that  P{xt  =  1)  ^  1  as  t  00.  A  finite  nondiscounted  MDP  will  have  a  well-defined 
optimal  value  function  as  long  as 

•  the  cost  function  has  bounded  mean  and  variance, 

•  there  exists  a  proper  strategy,  and 

•  there  does  not  exist  a  strategy  which  has  expected  cost  equal  to  —00  from  some  initial  state. 

From  now  on,  we  will  assume  that  all  MDPs  that  we  consider  have  a  well-defined  V*.  We  will  also  assume 
that  So  puts  a  nonzero  probability  on  every  state  in  S.  This  allows  us  to  avoid  worrying  about  inaccessible 

If  we  have  two  MDPs  Mi  =  (5,  Ai,6i,ci,7i,5o)  and  M2  =  (5,^2, 52,^2, T2,5o)  which  share  the  same 
state  space,  we  can  define  a  new  MDP  M12,  the  composition  of  Mi  and  M2,  by  alternately  following 
transitions  from  Mi  and  M2.  More  formally,  let  M12  =  (^jAi  x  A2,5i2,ci2,Ti72,5'o).  At  each  step, 
the  agent  will  select  one  action  from  Ai  and  one  from  A2;  we  define  the  composite  transition  function 
(5i2  so  that  ^12(2:,  (01,02))  =  ^2(^i(a;,oi),02).  The  cost  of  the  composite  action  will  be  ci2(x, (oi, 02))  - 

ci(a;,ai) -1-7102(^1(2;,  oi), 02).  .  .  ,  .4,  •  r 

A  trajectory  is  a  sequence  of  tuples  (xq,  oq,  cq),  (2:1 ,  oi ,  ci), . . .;  trajectories  describe  the  experiences  ot  an 

agent  acting  in  an  MDP.  If  the  MDP  is  absorbing,  there  will  be  a  point  t  so  that  Ct  =  Ci+i  =  . . .  =  0;  we 
will  usually  omit  the  portion  of  the  trajectory  after  t. 

Define  a  policy  to  be  a  function  tt  :  S  ^  A.  An  agent  may  follow  policy  ir  by  choosing  action  Trfx) 
whenever  it  is  in  state  x.  It  is  possible  to  generalize  the  above  definition  to  include  randomized  strategies 
and  strategies  which  change  over  time;  but  the  extra  generality  is  unnecessary.  It  is  well-known  [Bel61,  BT89] 
that  every  Markov  decision  process  with  a  well-defined  V*  has  at  least  one  optimal  policy  ir  ,  an  agent  which 
follows  IT*  will  do  at  least  as  well  as  any  other  agent,  including  agents  which  choose  actions  according  to 
non-policies.  The  policy  tt*  will  satisfy  Bellman’s  equation 

{\fx  G  S)  V*{x)  =  £'(c(r,7r*(x))  +  7y*(^(2;,ir*(£)))) 

and  every  policy  which  satisfies  Bellman’s  equation  is  optimal. 

There  are  two  broad  classes  of  learning  problems  for  Markov  decision  processes:  online  and  offline.  In 
both  cases,  we  wish  to  compute  an  optimal  policy  for  some  MDP.  In  the  offline  case,  we  are  allowed  access 
to  the  whole  MDP,  including  the  cost  and  transition  functions;  in  the  online  case,  we  are  only  given  S 
and  A,  and  then  must  discover  what  we  can  about  the  MDP  by  interacting  with  it.  (In  particular,  in  the 
online  case,  we  are  not  free  to  try  an  action  from  any  state;  we  are  limited  to  acting  in  the  current  state.) 
We  can  transform  an  online  problem  into  an  offline  one  by  observing  one  or  more  trajectories,  estimating 
the  cost  and  transition  functions,  and  then  pretending  that  our  estimates  are  the  truth.  (This  approach  is 
called  the  certainty  equivalent  method.)  Similarly,  we  can  transform  an  offline  problem  into  an  online  one 
by  pretending  that  we  don’t  know  the  cost  and  transition  functions.  Most  of  the  remainder  of  the  paper 
deals  with  offline  problems,  but  we  mention  online  problems  again  in  section  5. 

In  the  offline  case,  the  optimal  value  function  tells  us  the  optimal  policies;  we  may  set  x  (x)  to  be  any  a 
which  maximizes  E{c{x,a) -^-jV*{6{x,  a))).  (In  the  online  case,  V*  is  not  sufficient,  since  we  can’t  compute 
the  above  expectation.)  For  a  finite  MDP,  we  can  find  F*  by  dynamic  programming.  With  appropriate 
assumptions,  repeated  application  of  the  dynamic  programming  backup  operator 

F(x)  -f—  minf5(c(x,  a)  -f  7F(6(x,  a))) 
a^A 
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to  every  state  is  guaranteed  to  converge  to  from  any  initial  guess  [BT89].  (In  the  case  of  a  nondiscounted 
problem  with  cost-free  absorbing  state  g,  we  define  the  backup  operator  to  set  V{g)  ^  0  as  a  special  case.) 
This  dynamic  programming  algorithm  is  called  value  iteration.  If  we  need  to  solve  an  infinite  MDP  that 
satisfies  certain  continuity  conditions,  we  may  first  approximate  it  by  a  finite  MDP  as  described  in  [CT89], 
then  solve  the  finite  MDP  by  value  iteration.  For  this  reason,  the  remainder  of  the  paper  will  focus  on  finite 
(although  possibly  very  large)  Markov  decision  processes. 

We  can  generalize  the  above  single-state  version  of  the  value  iteration  backup  operator  to  allow  parallel 
updating:  instead  of  merely  changing  our  estimate  for  one  state  at  a  time,  we  compute  the  new  value  for 
every  state  before  altering  any  of  the  estimates.  The  result  of  this  change  is  the  parallel  value  iteration 
operator.  The  following  two  theorems  imply  the  convergence  of  parallel  value  iteration.  See  [BT89]  for 
proofs. 

Theorem  2.2  (vcdue  contraction)  The  parallel  value  iteration  operator  for  a  discounted  Markov  decision 
process  is  a  contraction  in  max  norm,  with  contraction  factor  equal  to  the  discount.  If  all  policies  in  a 
nondiscounted  Markov  decision  process  are  proper,  then  the  parallel  value  iteration  operator  for  that  process 
is  a  contraction  in  some  weighted  max  norm.  The  fixed  point  of  each  of  these  operators  is  the  optimal  value 
function  for  the  MDP. 

Theorem  2.3  Let  a  nondiscounted  Markov  decision  process  have  at  least  one  proper  policy,  and  lei  all 
improper  policies  have  expected  cost  equal  to  -foo  for  at  least  one  initial  state.  Then  the  parallel  value 
iteration  operator  for  that  process  converges  from  any  initial  guess  to  the  optimal  value  function  for  that 
process. 


3  Main  results:  a  simple  case 

In  this  section,  we  will  consider  only  discounted  Markov  decision  processes.  The  following  sections  generalize 
the  results  to  other  interesting  cases. 

Suppose  that  T  is  the  parallel  value  backup  operator  for  a  Markov  decision  process  M.  In  the  basic 
value  iteration  algorithm,  we  start  off  by  setting  Vo  to  some  initial  guess  at  M’s  value  function.  Then  we 
repeatedly  set  V+i  to  be  T{Vi)  until  we  either  run  out  of  time  or  decide  that  some  is  a  sufficiently 
accurate  approximation  to  V* .  Normally  we  would  represent  each  Vi  as  an  array  of  real  numbers  indexed 
by  the  states  of  M;  this  data  structure  allows  us  to  represent  any  possible  value  function  exactly. 

Now  suppose  that  we  wish  to  represent  Vi,  not  by  a  lookup  table,  but  by  some  other  more  compact  data 
structure  such  as  a  neural  net.  We  immediately  run  into  two  difficulties.  First,  computing  T{Vi)  generally 
requires  that  we  examine  Vi{x)  for  nearly  every  x  in  M’s  state  space;  and  if  M  has  enough  states  that  we 
can’t  afford  a  lookup  table,  we  probably  can’t  afford  to  compute  Vi  that  many  times  either.  Second,  even  if 
we  can  represent  Vi  exactly  with  a  neural  net,  there  is  no  guarantee  that  we  can  also  represent  T{Vi). 

To  address  both  of  these  difficulties,  we  will  assume  that  we  have  a  sample  Xq  of  states  from  M.  Xq 
should  be  small  enough  that  we  can  examine  each  element  repeatedly;  but  it  should  be  large  enough  and 
representative  enough  that  we  can  learn  something  about  M  by  examining  only  the  states  in  Xq-  Now  we  can 
define  an  approximate  value  iteration  algorithm.  Rather  than  setting  to  T{Vi),  we  will  first  compute 
(T(Vi))(x)  only  for  x  £  Xo',  then  we  will  fit  our  neural  net  (or  other  approximator)  to  these  training  values 
and  call  the  resulting  function  K'-i-i- 

In  order  to  reason  about  approximate  value  iteration,  we  will  consider  function  approximation  methods 
themselves  as  operators  on  the  space  of  value  functions:  given  any  target  value  function,  the  approximator 
will  produce  a  fitted  value  function,  as  shown  in  figure  1.  In  the  figure,  the  sample  Xo  is  the  first  five  natural 
numbers,  and  the  representable  functions  are  the  cubic  splines  with  knots  in  Jfo- 

The  characteristics  of  the  function  approximation  operator  determine  how  it  behaves  when  combined  with 
value  iteration.  One  particularly  important  property  is  illustrated  in  figure  2.  As  the  figure  shows,  linear 
regression  can  exaggerate  the  difference  between  two  target  value  functions  Vi  and  V2:  a  small  difference 
between  the  targets  V-i{x)  and  V2(x)  can  lead  to  a  larger  difference  between  the  fitted  values  Fi(a;)  and 
V2{x).  Many  function  approximators,  such  as  neural  nets  and  local  weighted  regression,  can  exaggerate  this 
way;  others,  such  as  ^:-nearest-neighbor,  can  not.  We  will  show  later  that  this  sort  of  exaggeration  can  cause 
instability  in  an  approximate  value  iteration  algorithm. 
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Figure  1:  Function  approximation  methods  as  mappings.  In  (a)  we  see  the  value  function  for  a  simple 
random  walk  on  the  positive  real  line.  (On  each  transition,  the  agent  has  an  equal  probability  of  moving  left 
or  right  by  one  step.  State  0  is  absorbing;  transitions  from  other  states  have  cost  1.)  Applying  a  function 
approximator  (in  this  case,  fitting  a  spline  with  knots  at  the  first  five  natural  numbers)  maps  the  value 
function  in  (a)  to  the  value  function  in  (b).  Since  the  function  approximator  discards  some  information,  its 
mapping  can’t  be  1-to-l:  in  (c)  we  see  a  different  value  function  which  the  approximator  also  maps  to  (b). 
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1 


0 


0  1  2 


(a)  (b) 

Figure  2:  The  mapping  associated  with  linear  regression  when  samples  are  taken  at  the  points  x  =  0, 1, 2.  In 
(a)  we  see  a  target  value  function  (solid  line)  and  its  corresponding  fitted  value  function  (dotted  line).  In  (b) 
we  see  another  target  function  and  another  fitted  function.  The  first  target  function  has  values  j/  =  0,0,0 
at  the  sample  points;  the  second  has  values  y  =  0, 1, 1.  Regression  exaggerates  the  difference  between  the 
two  functions:  the  largest  difference  between  the  two  target  functions  at  a  sample  point  is  1  (at  x  =  1  and 
X  =  2),  but  the  largest  difference  between  the  two  fitted  functions  at  a  sample  point  is  |  (at  x  =  2). 


Definition:  Suppose  we  wish  to  approximate  a  function  from  a  space  5  to  a  vector  space  R.  Fix  a  sample 
vector  Xo  of  points  from  5,  and  fix  a  function  approximation  scheme  A.  Now  for  each  possible  vector  Y  of 
target  values  in  R,  A  will  produce  a  function  /  from  S  to  R.  Define  Y  to  be  the  vector  of  fitted  values;  that 
is,  the  i-th  element  of  Y  will  be  /  applied  to  the  f-th  element  of  Xq.  Now  define  Ma,  the  mapping  associated 
with  A,  to  be  the  function  which  takes  each  possible  Y  to  its  corresponding  Y . 

Now  we  can  apply  the  powerful  theorems  about  contraction  mappings  to  function  approximation  methods. 
In  fact,  it  will  turn  out  that  if  Ma  is  a  nonexpansion  in  an  appropriate  norm,  the  combination  of  A  with 
value  iteration  is  stable.  (That  is,  under  the  usual  assumptions,  value  iteration  will  converge  to  some 
approximation  of  the  value  function.)  The  rest  of  this  section  states  the  required  property  more  formally, 
then  proves  that  some  common  function  approximators  have  this  property. 

Definition:  Two  operators  Mp  and  T  on  the  space  S  are  compatible  if  repeated  application  of  Mp  o  T  is 
guaranteed  to  converge  to  some  x*  £  S'  from  any  initial  guess  xo  G  S’. 

Note  that  compatibility  is  symmetric:  if  the  sequence  of  operators  Mp ,  T,  Mp ,  T,  Mp , . . .  causes  conver¬ 
gence  from  any  initial  guess  xo,  then  it  also  causes  convergence  from  T(xo);  so  the  sequence  T,  Mp,T,  Mp,. . . 
also  causes  convergence. 

Theorem  3.1  Let  Tm  be  the  parallel  value  backup  operator  for  some  Markov  decision  process  M  with  state 
space  S,  action  space  A,  and  discount  7  <  1.  Let  X  =  S  x  A.  Let  F  be  a  be  a  function  approximator  (for 
functions  from  X  io  IRJ  with  mapping  Mp  £  ^  Suppose  Mp  is  a  nonexpansion  in  max  norm. 

Then  Mp  is  compatible  withTM,  o-nd  Mp  oTm  has  contraction  factor  j . 
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(a) 


(b) 


Figure  3:  A  Markov  process  and  a  CMAC  which  are  incompatible.  Part  (a)  shows  the  process.  Its  goal  is 
state  1.  On  each  step,  with  probability  95%,  the  process  follows  a  solid  arrow,  and  with  probability  5%, 
it  follows  a  dashed  arrow.  All  arc  costs  are  zero.  Part  (b)  shows  the  CMAC,  which  has  4  receptive  fields 
each  covering  3  nodes.  If  the  CMAC  starts  out  with  all  predictions  equal  to  1,  approximate  value  iteration 
produces  the  series  of  target  %’alues  1,  i^)^, ...  for  state  2. 


Proof:  By  the  value  contraction  theorem,  Tm  is  a,  contraction  in  max  norm  with  factor  7.  By  assumption, 
Mp  is  a  nonexpansion  in  max  norm.  Therefore  Mp  oTm  is  a  contraction  in  max  norm  by  the  factor  7.  D 

Corollary  3.1  The  approximate  value  iteration  algorithm  based  on  F  converges  in  max  norm  at  the  rate  7 
when  applied  to  M . 

It  remains  to  show  which  function  approximators  are  compatible  with  value  iteration.  Many  common 
approximators  are  incompatible  with  standard  parallel  value  backup  operators.  For  example,  as  figure  2 
demonstrates,  linear  regression  can  be  an  expansion  in  max  norm;  and  Boyan  and  Moore  [BM95]  show  that 
the  combination  of  value  iteration  with  linear  regression  can  diverge.  Other  incompatible  methods  include 
standard  feedforward  neural  nets,  some  forms  of  spline  fitting,  and  local  weighted  regression  [BM95]. 

A  particularly  interesting  case  is  the  CMAC.  Sutton  [SS94]  has  recently  reported  success  in  combining 
value  iteration  with  a  CMAC,  and  has  suggested  that  function  approximators  similar  to  the  CMAC  are 
likely  to  allow  temporal  differencing  to  converge  in  the  online  case.  While  we  have  no  evidence  to  support 
or  refute  this  suggestion,  it  is  worth  mentioning  that  convergence  is  not  guaranteed  in  our  offline  framework, 
as  the  counterexample  in  figure  3  shows. 

In  this  example,  the  optimal  value  function  is  uniformly  zero,  since  all  arc  costs  are  zero.  The  exact 
value  backup  operator  assigns  0  to  T(l)  (since  state  1  is  the  goal)  and  .051^(1)  +  .951^(2)  to  V^(f)  for  *  7^  1 
(since  all  states  except  1  have  a  5%  chance  of  transitioning  to  state  1  and  a  95%  chance  of  transitioning  to 
state  2).  The  approximate  backup  operator  computes  these  same  6  numbers  as  training  data  for  the  CMAC; 
but  since  the  CMAC  can’t  represent  this  function  exactly,  our  output  is  the  closest  representable  function 
in  the  sense  of  least  squared  error.  If  the  exact  operator  produces  lev  where  v=(0,l,l,l,l,l)  ,  then  the 
closest  representable  function  will  be  kvf  where  w  =  (^,  3,  f ,  f  i  f  j  |)"*^-  If  repeat  the  process  frorn^this 
new  value  function,  the  exact  backup  operator  will  now  produce  ^v,  and  the  CMAC  will  produce  -gQ-w. 
Further  iteration  causes  divergence  at  the  rate 

The  CMAC  still  diverges  even  if  we  choose  a  small  learning  rate:  with  learning  rate  a,  the  rate  of 
divergence  is  l  +  However,  if  we  train  the  CMAC  based  on  actual  trajectories  in  the  Markov  process,  as 
Sutton  suggests,  we  no  longer  diverge:  transitions  out  of  state  2,  which  lower  the  coefficients  of  our  CMAC 
substantially,  are  as  frequent  as  transitions  into  state  2,  which  raise  the  coefficients  somewhat.  In  fact,  since 
this  example  is  a  Markov  process  rather  than  a  Markov  decision  process,  convergence  of  the  online  algorithm 
is  guaranteed  by  a  theorem  in  [Day92]. 

We  will  prove  that  a  broad  class  of  approximation  methods  is  compatible  with  value  iteration.  This 
class  includes  kernel  averaging,  i-nearest-neighbor,  weighted  A-nearest-neighbor,  Bezier  patches,  linear  in¬ 
terpolation  on  a  triangular  (or  tetrahedral,  etc.)  mesh,  bilinear  interpolation  on  a  square  (or  cubical,  etc.) 
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mesh,  and  many  others.  (See  the  Experiments  section  for  a  definition  of  bilinear  interpolation.  Note  that 
the  square  mesh  is  important:  on  non- rectangular  meshes,  bilinear  interpolation  will  sometimes  need  to 
extrapolate.) 

Definition:  A  real- valued  function  approximation  scheme  is  an  averager  if  every  fitted  value  is  the  weighted 
average  of  zero  or  more  target  values  and  possibly  some  predetermined  constants.  The  weights  involved  in 
calculating  the  fitted  value  Yi  may  depend  on  the  sample  vector  Xo,  but  may  not  depend  on  the  target  values 
y.  More  precisely,  for  a  fixed  Xo,  if  Y  has  n  elements,  there  must  exist  n  real  numbers  ki,  nonnegative 
real  numbers  /?jj,  and  n  nonnegative  real  numbers  so  that  for  each  i  we  have  j3i  f^ij  =  1 

Yi  =  piki  +  Zj^iiYj. 

It  should  be  obvious  that  all  of  the  methods  mentioned  before  the  definition  are  averagers:  in  all  cases, 
the  fitted  value  at  any  given  coordinate  is  a  weighted  average  of  target  values,  and  the  weights  are  determined 
by  distances  in  X  values,  and  so  are  unaffected  by  the  target  Y  values. 

Theorem  3.2  The  mapping  Mf  associated  with  any  averaging  method  F  is  a  nonexpansion  in  max  norm, 
and  is  therefore  compatible  with  the  parallel  value  backup  operator  for  any  discounted  MDP. 


Proof:  Fix  the  /3s  and  ks  for  P  as  in  the  above  definition.  Let  Y  and  Z  be  two  vectors  of  target  values. 
Consider  a  particular  coordinate  i.  Then,  letting  ||  •  ||  denote  max  norm, 


I  [Mf  (Y)  -  Mf  (Y)]i 


1  {I3iki  +  '^l3ijYj)  -  {pih  -f  Y.  I 

j  3 

3 

max(y}  —  Zj) 

\\Y-Z\\ 


That  is,  every  element  of  (Mf(Y)  —  Mf(P))  is  no  larger  than  ||  Y  —  Z\\.  Therefore,  the  max  norm  of 
{Mf{Y)  -  Mf{Z))  is  no  larger  than  ||  Y  -  Y  ||.  In  other  words,  Mf  is  a  nonexpansion  in  max  norm.  □ 


4  Nondiscount ed  processes 

Consider  now  a  nondiscounted  Markov  decision  process  M.  Suppose  for  the  moment  that  all  policies  for 
M  are  proper.  Then  the  value  contraction  theorem  states  that  the  parallel  value  backup  operator  Tm  for 
this  process  is  a  contraction  in  some  weighted  max  norm  ||  •  \\w  ■  The  previous  section  proved  that  if  the 
approximation  method  A  is  an  averager,  then  Ma  is  a  nonexpansion  in  (unweighted)  max  norm.  If  we  could 
prove  that  Ma  were  also  a  nonexpansion  in  ||  •  ||iv,  we  would  have  Ma  compatible  with  Tm-  Unfortunately, 
Ma  may  be  an  expansion  in  |1  •  ||iv;  in  fact  parts  (a)  and  (b)  of  figure  4  show  a  simple  example  of  a 
nondiscounted  MDP  and  an  averager  which  are  incompatible.  (Part  (c)  is  a  simple  proof  that  the  MDP  and 
the  averager  are  incompatible;  see  below  for  an  explanation.) 

Fortunately,  there  are  averagers  which  are  compatible  with  nondiscounted  MDPs.  The  proof  relies  on 
an  intriguing  property  of  averagers:  we  can  view  any  averager  as  a  Markov  process,  so  that  state  x  has  a 
transition  to  state  y  whenever  /Sxy  >  0,  i.e.,  whenever  the  fitted  V{x)  depends  on  the  target  V{y).  Part 
(b)  of  figure  4  shows  one  example  of  a  simple  averager  viewed  as  a  Markov  process;  this  averager  has 
l3n  =  ^23  =  033  =  1  and  all  other  coefficients  zero. 

If  we  view  an  averager  as  a  Markov  process,  and  compose  this  process  with  our  original  MDP,  we  will 
derive  a  new  MDP.  Part  (c)  of  figure  4  shows  a  simple  example;  a  slightly  more  complicated  example  is 
in  figure  5.  As  the  following  theorem  shows,  exact  value  iteration  on  this  derived  MDP  is  the  same  as 
approximate  value  iteration  on  the  original  MDP. 

Theorem  4.1  (Derived  MDP)  For  any  averager  A  with  mapping  Ma,  and  for  any  MDP  M  (either 
discounted  or  nondiscounted)  with  parallel  value  backup  operator  Tm,  the  function  Tm  °  Ma  is  the  parallel 
value  backup  operator  for  a  new  Markov  decision  process  M' . 


7 


(a) 


(b) 


(c) 


Figure  4:  A  nondiscounted  deterministic  Markov  process,  and  an  averager  with  which  it  is  incompatible. 
The  process  is  shown  in  (a);  the  goal  is  state  1,  and  all  arc  costs  except  at  the  goal  are  1.  In  (b)  we  see  an 
averager,  represented  as  a  Markov  process;  states  1  and  3  are  unchanged,  while  1^(2)  is  replaced  by  1^(3). 
The  derived  Markov  process  is  shown  in  (c);  state  3  has  been  disconnected,  so  its  value  estimate  will  diverge. 


r 
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1  r  r  /  , 
f  r  /  /  /  . 
r  ;  ;  /  /  , 
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(a) 


Figure  5:  An  example  of  the  construction  of  the  derived  Markov  process.  Part  (a)  shows  a  deterministic 
Markov  process;  its  state  space  is  the  unit  triangle,  and  on  every  step  the  agent  moves  a  constant  distance 
towards  the  origin.  The  value  of  each  state  is  simply  its  distance  from  the  origin,  so  the  value  function 
is  nonlinear.  For  our  function  approximator,  we  will  use  linear  interpolation  on  the  three  corners  of  the 
triangle.  Part  (b)  shows  a  representative  transition  from  the  derived  process;  as  before,  the  agent  moves 
towards  the  goal,  but  then  the  averager  moves  the  agent  randomly  to  one  of  the  three  corners.  On  average, 
this  scattering  moves  the  agent  back  away  from  the  goal,  so  steps  in  the  derived  process  don  t  move  the 
agent  as  far  on  average  as  they  did  in  the  original  process.  Part  (c)  shows  the  expected  progress  the  agent 
makes  on  each  step.  The  value  function  for  the  derived  process  is  V*(x,  y)  =  x  +  y. 
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Proof:  Define  the  derived  MDP  M'  as  follows.  It  will  have  the  same  state  and  action  spaces  as  M,  and 
it  will  also  have  the  same  discount  factor  and  initial  distribution.  We  can  assume  without  loss  of  generality 
that  state  1  of  M  is  cost-free  and  absorbing:  if  not,  we  can  renumber  the  states  of  M  starting  at  2,  add  a 
new  state  1  which  satisfies  this  property,  and  make  all  of  its  incoming  transition  probabilities  zero.  We  can 
also  assume,  again  without  loss  of  generality,  that  /?i  =  1  and  ki  =  0  (that  is,  that  A  always  sets  P(l)  =  0) 
—  again,  if  this  property  does  not  already  hold  for  M,  we  can  add  a  new  state  1. 

Suppose  that,  in  M,  action  a  in  state  x  takes  us  to  state  y  with  probability  paxy  Suppose  that  A  replaces 
V{y)  by  pyky  +  PyzVi^)-  Then  we  will  define  the  transition  probabilities  in  M'  for  state  x  and  action  a 
to  be 


P'axz  —  ^^Paxyl^yz 
y 

Paxl  —  ^^PaxyiPyl  +  Py) 
y 

These  transition  probabilities  make  sense:  since  A  is  an  averager,  we  know  that  /?y^  A  Py  is  1,  so 

^^P'axz  —  ^^.^~^.Paz;y0yz  +  '^^,Paxy{Pyl  +  Py) 

z  zyil  y  y 

~  ^^Paxy  +  ^y^ 

—  y~lPoxy  =  1 

y 

Now  suppose  that,  in  M,  performing  action  a  from  state  x  yields  expected  cost  Cxa-  Then  performing 
action  a  from  state  x  in  M'  yields  expected  cost 

C®o  ~  *'xa  +  7  ^^Paxy'0y'ky' 

y' 

Now  the  parallel  value  backup  operator  Tm'  for  M'  is 
V{x)  <—  minil(c'(z:,a) -1- 7P(5'(x,a))) 

a 

=  mm -fViz)) 

Z 

=  m^in  Paxy0yz'^  (4o  +  7f^(z))  +  ^X^Paxy(/3yl  +/?y)^  (4a  +7’t^(l)) 

=  ““S  Paxy  0yzicia  +  iH^))  +  i0yl  +  ^yWx, 

y 

=  min  Y  P'^^y  f  4a  +  7  X]  ^y^  ^ (^)') 


=  min  4a  +  7X^PaxyX]/?yxi^(4 


zyil 


Cxa  +  7  y  jPaxy' 0y'^y'  T  7  ^  ,Paxy  E  0yzV{z) 

y‘  y  z:^l 


min 


On  the  other  hand,  the  parallel  value  backup  operator  for  M  is 

V{x)  -f—  min£'(c(x,a) +7K(^(a;,a))) 

a 

=  min^  Paxy{Cxa  7^  (y)) 

y 

If  we  replace  V{y)  by  its  approximation  under  A,  the  operator  becomes  Tm  ° 

V{x)  <-  ram.'Y^paxy{cxa-\-l{0yky  +  ^l3yzV{z)) 

y  \  ^ 

=  mjn  Cxa  +  7  XI  P<^^y^y  +  T  X  X  ^ 

\  y  y  ^7^1 

which  is  exactly  the  same  as  Tm'  above. 

Given  an  initial  estimate  Vq  of  the  value  function,  approximate  value  iteration  begins  by  computing 
Ma{Vq),  the  representation  of  Vq  in  A.  Then  it  alternately  applies  Tm  and  Ma  to  produce  the  series  of 
functions  Vo,  Ma(Vo),Tm(Ma(Vo)),  Ma(Tm(Ma(Vo))),  . . ..  (In  an  actual  implementation,  only  the  functions 
Ma{-  ■  •)  would  be  represented  explicitly;  the  functions  Tm{-  -  •)  would  just  be  sampled  at  the  points  Xq.)  On 
the  other  hand,  exact  value  iteration  on  M'  produces  the  series  of  functions  Vo,Tm  oMa{Vq),Tm  °Ma{Tm  ° 

M^(Vb)), _  This  series  obviously  contains  exactly  the  same  information  as  the  previous  one.  The  only 

difference  between  the  two  algorithms  is  that  approximate  value  iteration  would  stop  at  one  of  the  functions 
Ma{-  •  •)>  while  iteration  on  M'  would  stop  at  one  of  the  functions  Tm{-  •  •)• 

Now  we  can  see  why  the  combination  in  figure  4  diverges:  the  derived  MDP  has  a  state  with  infinite 
cost.  So,  in  order -to  prove  compatibility,  we  need  to  guarantee  that  the  derived  MDP  is  well-behaved. 

If  the  arc  costs  of  a  discounted  MDP  M  have  finite  mean  and  variance,  it  is  obvious  that  the  arc  costs 
of  M'  also  have  finite  mean  and  variance.  That  means  that  Tm'  —  Tm  o  Ma  converges  in  max  norm  at  the 
rate  7  —  f.e.,  we  have  just  proven  again  that  Ma  is  compatible  with  Tm- 

More  importantly,  if  M  is  a  finite  nondiscounted  process,  there  are  averagers  which  are  compatible  with 
it.  For  example,  if  A  uses  weight  decay  (f.e.,  if  /?y  >  0  for  all  y),  then  M'  will  have  all  policies  proper,  since 
any  action  in  any  state  has  a  nonzero  probability  of  bringing  us  immediately  to  state  1 . 

More  generally,  if  M  has  only  proper  policies,  we  may  partition  its  state  space  by  distance  from  state  1, 
as  follows  (see  [BT89]).  Let  Si  =  {!}.  Now  recursively  define  Uk  =  and  Sk  as 

Sk  =  ^  a:  I  (2:  ^  A  (min  max  P{6{x,  a)  =  y)  >  0) 

(  a^A  yeUk 

Bertsekas  and  Tsitsiklis  show  that  this  partitioning  is  exhaustive;  so  we  may  set  k{x)  to  be  the  unique  k  so 
that  X  e  Sk-  If  M  is  finite,  there  will  be  finitely  many  nonempty  Sk- 

We  will  say  that  an  averager  A  is  self-weighted  for  M  if  for  every  state  y,  either  /?j,  >  0  or  there  exists  a 

state  X  so  that  k{x)  <  k{y)  and  fdyx  >  0. 

Now,  if  A  is  self-weighted  for  a  finite  nondiscounted  MDP  M,  then  M'  will  have  only  proper  policies:  if 
at  some  time  we  are  in  state  x,  then  no  matter  what  action  a  we  take,  there  is  a  nonzero  chance  that  we  will 
immediately  follow  an  arc  to  a  state  in  Sk  for  k  <  k{x).  (By  definition  of  the  partition,  there  is  a  transition 
in  M  under  a  from  x  to  some  y  so  that  k{y)  <  k{x).  By  the  self-weighting  property,  either  Pyz  must  be 
nonzero  for  some  z  so  that  k{z)  <  k(y),  in  which  case  x  has  a  possible  transition  in  M'  under  a  to  z;  or 
Sy  >  0,  in  which  case  x  has  a  possible  transition  in  M'  under  a  to  state  1.)  If  we  follow  such  an  arc,  there  is 

then  a  nonzero  chance  that  we  will  immediately  follow  another  arc  to  a  state  in  Sk‘  for  k'  <  k,  and  so  forth 

until  we  eventually  (with  positive  probability)  reach  partition  Si  and  therefore  state  1 .  Let  m  be  the  largest 
integer  so  that  Sm  is  nonempty.  Then  the  previous  discussion  shows  that,  no  matter  what  state  we  start 
from,  we  have  a  positive  probability  of  reaching  state  1  in  m  —  1  steps.  Call  the  smallest  such  probability 
e.  Then  in  k(m  —  1)  steps,  we  have  probability  at  least  1  —  (1  -  e)*  of  reaching  state  1.  The  limit  of  this 
quantity  as  /Ir  00  is  1;  so  with  probability  1  we  eventually  absorb  from  any  initial  state. 

We  have  just  proven 
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Corollary  4.1  Let  M  be  a  finite  nondiscounted  Markov  decision  process.  Suppose  all  policies  in  M  are 
proper.  Let  A  be  an  averager  which  is  self-weighted  for  M .  Let  Tm  and  Ma  be  the  parallel  backup  operator 
for  M  and  the  mapping  associated  with  A.  Then  Tm  dnd  Ma  <ire  compatible,  and  the  approximate  value 
iteration  algorithm  based  on  A  will  converge  if  is  applied  to  M . 

If  A  ignores  F(a;)  for  all  states  x  not  in  some  sample  Xq,  then  the  states  in  Xq  will  play  an  important 
role  in  the  derived  MDP  M'.  While  the  initial  state  of  a  trajectory  may  be  outside  of  Xq,  all  transitions 
in  M'  lead  to  states  in  Xq,  so  after  one  step  the  trajectory  will  enter  Xo  and  stay  there  indefinitely.  This 
means  that  we  can  characterize  a  large  part  of  M'  by  looking  only  at  its  behavior  on  Xq  —  just  what  we 
need  for  a  tractable  algorithm.  (It  is  worth  mentioning  that,  if  M  is  nondiscounted,  Xq  should  contain  the 
goal  state:  if  it  doesn’t,  M'  will  have  no  transitions  into  the  goal,  so  all  of  its  values  will  be  infinite.) 

5  The  online  problem  and  Q-learning 

The  results  of  the  previous  sections  carry  over  directly  to  a  gradual  version  of  the  parallel  value  backup 
operator 

min£'(c(a:,a)  +  7V"(6(a;,a))) 

a£A 

in  which,  rather  than  replacing  ^(a;)  by  its  computed  update  on  each  step,  we  take  a  weighted  average  of 
the  old  and  new  values  of  V’(x).  (The  weights  ctx  may  differ  for  each  x,  and  may  change  from  iteration  to 
iteration.)  That  is,  we  can  still  construct  the  derived  MDP  M'  and  perform  gradual  value  iteration  on  it, 
and  gradual  value  iteration  on  M'  is  still  the  same  as  gradual  approximate  value  iteration  on  M . 

The  results  also  apply  nearly  directly  to  dynamic  programming  with  Watkins’  [Wat89]  Q*  function 

g*(a;,a)  =  jE;(c(a:,a)  +  7'P*(5(x,  a))) 

and  g-learning  operator 

g(j;,a)  £(c(x,a)  +  7ining(5(a:,a),a')) 

(The  learning  rates  Oxa  may  now  be  random  variables,  and  may  depend  for  each  step  not  only  on  x  and 
a  but  also  on  the  entire  past  history  of  the  agent’s  interactions  with  the  MDP,  up  to  but  not  including  the 
current  values  of  the  random  variables  c{x,a)  and  ^(x,a).)  That  is,  we  can  still  define  a  derived  MDP  so 
that  the  behavior  of  the  approximate  algorithm  on  the  original  MDP  is  the  same  as  the  behavior  of  the  exact 
algorithm  on  the  derived  MDP.  The  derived  MDP  is,  however,  slightly  different  from  the  derived  MDP  for 
value  iteration;  see  the  Appendix  for  details. 

The  previous  paragraphs  imply  that,  if  we  could  sample  transitions  at  will  from  the  derived  MDP 
(and  so  compute  unbiased  estimates  of  the  g-learning  updates),  we  could  apply  gradual  g-learning  to  these 
transitions  and  learn  a  policy.  The  convergence  of  this  algorithm  would  be  guaranteed,  as  long  as  the  weights 
Oxa  were  chosen  appropriately,  by  any  sufficiently  general  convergence  proof  for  g-learning  [JJS94,  Tsi94]. 

Unfortunately,  there  is  a  catch,  g-learning  is  designed  to  work  for  online  problems,  where  we  don’t  know 
the  cost  or  transition  functions  and  can  only  sample  transitions  from  our  current  state.  The  power  of  the 
approximate  value  iteration  method,  on  the  other  hand,  comes  from  the  fact  that  we  can  pay  attention  only 
to  transitions  from  a  certain  small  set  of  states.  So,  while  the  derived  MDP  for  g-learning  will  still  have 
only  a  few  relevant  states,  we  won’t  in  general  be  able  to  observe  many  transitions  from  these  states,  and  so 
the  approximate  g-learning  iteration  will  take  a  very  long  time  to  converge. 

There  are  two  ways  that  the  approximate  g-learning  algorithm  might  still  be  useful.  The  first  is  if  we 
encounter  a  problem  which  is  somewhere  between  online  and  offline:  we  can  sample  any  transition  at  will, 
but  don’t  know  the  cost  or  transition  functions  a  priori  or  can’t  compute  the  necessary  expectations.  In 
such  a  problem  we  still  can’t  use  value  iteration,  since  it  is  difficult  to  compute  an  unbiased  estimate  of  the 
value  iteration  update,  so  the  approximate  g-learning  algorithm  is  helpful. 

The  second  way  is  if  we  are  willing  to  accept  possible  lack  of  convergence.  Suppose  our  function  ap¬ 
proximator  pays  attention  to  the  states  in  the  set  Xq.  If  we  pretend  that  every  transition  we  see  from  a 
state  X  ^  Xo  is  actually  a  transition  from  the  nearest  state  x'  €  Xq,  we  will  then  have  enough  data  to 
compute  the  behavior  of  the  derived  MDP  on  Xq.  Unfortunately,  following  this  approximation  is  equivalent 
to  introducing  hidden  state  into  the  derived  MDP;  so  we  now  run  the  risk  of  divergence. 
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6  Converging  to  what? 

Until  now,  we  have  only  considered  the  convergence  or  divergence  of  approximate  dynamic  programming 
algorithms.  Of  course  we  would  like  not  only  convergence,  but  convergence  to  a  reasonable  approximation 
of  the  value  function.  The  next  section  contains  some  empirical  studies  of  approximate  value  iteration; 
this  section  proves  some  error  bounds,  then  outlines  the  types  of  problems  we  have  encountered  while 
experimenting  with  approximate  value  iteration. 

6.1  Error  bounds 

Suppose  that  M  is  an  MDP  with  value  function  U*,  and  let  A  be  an  averager.  What  if  V*  is  also  a  fixed 
point  of  M^?  Then  U*  is  a  fixed  point  of  Tm  o  M^;  so,  if  we  can  show  that  Tm  °  Ma  converges,  we  will 
know  that  it  converges  to  the  right  answer. 

If  M  is  discounted  and  has  bounded  costs,  TmoMa  will  always  converge;  the  following  theorem  describes 
some  situations  where  nondiscounted  problems  converge. 

Theorem  6.1  Let  V*  he  the  optimal  value  function  for  a  finite  nondiscounted  Markov  decision  process  M . 
Let  Tm  he  the  parallel  value  backup  operator  for  M .  Let  Ma  he  the  mapping  for  an  averager  A,  and  let  V* 
also  he  a  fixed  point  of  Ma-  Then  V*  is  a  fixed  point  ofTM  oMa;  and  if  either 

1.  M  has  all  policies  proper  and  A  is  self-weighted  for  M,  or 

S.  M  has  E(c{x,  a))  >  0  for  all  x  ^  1  and  A  has  k^  >0  for  all  x 

then  iteration  ofTM  «  Ma  converges  to  V* . 

Proof:  By  the  value  contraction  theorem,  V*  is  a  fixed  point  of  Tm-  So  Tm  o  Ma{V*)  =  Tm{V*)  =  U*; 
that  is,  V*  is  a  fixed  point  of  Tm  o  Ma  - 

If  (1)  holds,  then  by  the  corollary  to  the  derived  MDP  theorem,  Tm  o  Ma  is  a  contraction  in  some 
weighted  max  norm;  so  V*  is  the  unique  fixed  point  of  Tm  oMa-,  and  iteration  of  Tm  o  Ma  converges  to  V* . 

If  (2)  holds,  then  M'  has  c'(x,a)  >  0  for  all  a:  1;  so  every  improper  policy  in  M'  has  infinite  cost  for 
some  initial  states.  If  we  can  show  that  M'  has  at  least  one  proper  policy,  then  Tm  o  Ma  must  converge 
(and  therefore  must  converge  to  V*). 

Note  that  U*(l)  =  0  and  U*(a:)  >  0  for  x  1,  since  all  arc  costs  in  M  are  positive.  Suppose  we  start 
at  some  non-goal  state  x  in  M',  and  choose  an  action  a  so  that  V*{x)  =  E{c’{x,  a)  ■+-  V*(6'{x,  a))).  (There 
must  be  such  an  action,  since  V*  is  a  fixed  point  of  the  value  backup  operator  for  M'.)  Since  c'(a;,a)  >  0, 
we  know  that  U*(a:)  >  £'(U*(5'(a;,  a))).  In  particular,  there  must  be  a  possible  transition  to  some  state  y  so 
that  U*(a:)  >  V*{y).  If  y  is  not  the  goal,  we  can  repeat  the  argument  to  find  a  z  so  that  y  has  a  possible 
transition  to  z  and  V*{y)  >  V*(z),  and  so  forth  until  (with  positive  probability)  we  eventually  reach  the 
goal.  □ 

The  above  theorem  is  useful  only  when  we  know  that  the  optimal  value  function  is  a  fixed  point  of  our 
averager.  For  example,  it  shows  that  bilinear  interpolation  will  converge  to  the  exact  value  function  for  a 
gridworld,  if  every  arc’s  cost  is  equal  to  its  (Manhattan)  length,  since  the  value  function  for  this  MDP  is 
linear.  If  we  are  trying  to  solve  a  discounted  MDP,  on  the  other  hand,  we  can  prove  a  much  stronger  result: 
if  we  only  know  that  the  optimal  value  function  is  somewhere  near  a  fixed  point  of  our  averager,  we  can  still 
guarantee  an  error  bound  for  approximate  value  iteration. 

Theorem  6.2  Let  V*  he  the  optimal  value  function  for  a  finite  Markov  decision  process  M  with  discount 
factor  j.  Let  Tm  he  the  parallel  value  backup  operator  for  M .  Let  Ma  be  the  mapping  for  an  averager  A. 
Let  be  any  fixed  point  of  Ma-  Suppose  ||  —  U*  ||  =  e,  where  ||  - 1|  denotes  max  norm.  Then  iteration 

ofTM  °  Ma  converges  to  a  value  function  Vb  so  that  |1  Vb  —  U*!!  < 

Proof:  By  the  value  contraction  theorem,  Tm  is  a  contraction  in  max  norm  with  factor  j.  By  theorem  3.2, 
Ma  is  a  nonexpansion  in  max  norm.  So,  Tm  °Ma  is  a  contraction  in  max  norm  with  factor  7,  and  therefore 
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converges  to  some  Vb-  Repeated  application  of  the  triangle  inequality  and  the  definition  of  a  contraction 
give 


\\Vo-TM{MAiV*))\\ 

zz 

||TM(M^(Po))-TM(M^(r))|| 

< 

7||Ko-V^*|| 

\\TM{MAiV*))-V*\\ 

zz 

\\Tm{Ma{V*))-Tm(V)\\ 

< 

j\\MAiV*)-V*  II 

< 

'r\\MA{v*)-v^  ll  +  ^l|y^_r|| 

= 

7||  M4P*)  -  M^(I/^)  11  +  7||  II 

< 

11  + Til  ^^-^*11 

* 

1 

< 

11  Vb  -TM(M^(V*))li  +  II^M(M^(t^*))  -  y* 

< 

7iivb-y*  ii+27iiy*-v^^ll 

(l-7)liVb,-V*|l 

< 

2711  11 

2je 

II  ^0  -  II 

< 

1-7 

which  is  what  was  required. 

If  we  let  7  — ►  0,  we  can  make  the  above  error  bound  arbitrarily  small.  This  result  is  somewhat  counter¬ 
intuitive,  since  A  may  not  even  be  able  to  represent  V^*  exactly.  The  reason  for  this  behavior  is  that  the 
final  step  in  computing  Vq  is  to  apply  Tm;  when  7  =  0,  this  step  produces  V*  immediately. 

As  mentioned  in  a  previous  section,  approximate  value  iteration  returns  Ma  (Vb)  rather  than  Vo  itself.  So, 
an  error  bound  for  M^(Vb)  would  be  useful.  The  error  bound  on  Vb  leads  directly  to  a  bound  for  M^(Vb): 


V*-MAiVQ)\\  < 

< 

< 

< 


\\V*  -V^\\  +  \\V^-Ma{Vo)\\ 
e  +  \\MAiV^)-MAiVo)\\ 
e  +  ||I^^-Vb|| 


By  way  of  comparison,  Chow  and  Tsitsiklis  [CT89]  bound  the  error  introduced  by  using  a  grid  of  side  h 
to  compute  the  value  function  for  a  type  of  continuous-state-space  MDP.  Writing  for  the  approximate 
value  function  computed  this  way,  their  bound  is 


II  -  v:  II  <  Y^iKi  +  tA'2||  1^-  IIq) 


(for  all  sufficiently  small  h)  where  ||  •  Hq  is  the  span  quasi-norm 

II  T  IIq  =  sup  F{x)  -  inf  F{x) 

X  ® 

and  A'l  and  Ko  are  constants.  While  their  formalism  allows  them  to  discretize  the  available  actions  as  well 
as  the  state  space,  an  approximation  which  we  don’t  consider,  their  bound  also  applies  to  processes  with  a 
finite  number  of  actions.  So,  we  can  compare  their  bound  on  for  the  case  of  non-discretized  actions  to 
our  bound  on  Ma{Vo)- 

Write  Mh  for  the  mapping  associated  with  discretization  on  a  grid  of  side  h.  (For  definiteness,  assume  that 
{Mh{V)){x)  =  V{x')  where  x'  is  the  center  of  the  grid  cell  which  contains  x.  Almost  any  other  reasonable 
convention  would  also  work.)  The  processes  that  Chow  and  Tsitsiklis  consider  satisfy 

|I"*(x)-V^*(a;')l<-^ll»-^'ll 

for  some  constant  L  (f.e.,  V*  is  Lipschitz  continuous).  (We  can  determine  L  easily  from  the  cost  and 
transition  functions.)  So,  we  can  find  a  fixed  point  Vh  of  Mh  which  is  close  to  V*,  as  follows.  Pick  a  grid  cell 
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C.  Write  Ch  and  Cl  for  the  maximum  and  minimum  values  of  V*  in  C;  write  xh  and  xl  for  states  x  E  C 
which  achieve  these  values.  (Such  states  exist  because  V*  is  continuous.)  We  know  that  ||  X[j  —  xl\\  <  h, 
since  the  diameter  of  C  is  h.  Therefore,  by  the  Lipschitz  condition,  |  Ch  ~  C'i  I  <  hL.  So,  if  we  set  Vh{x)  to 
be  \{Ch  -  Cl)  for  all  x  E  C,  the  largest  difference  between  V*  and  14  on  C  will  be  Applying  our  error 
bound  now  gives 

which  has  slightly  different  constants  but  the  same  behavior  as  h  — »■  0  or  7  — *•  1  as  the  bound  of  Chow 
and  Tsitsiklis.  Our  bound  is  valid  for  any  h,  rather  than  just  sufficiently  small  h.  Their  bound  depends  on 
II  V*  IIq;  since  ||  V*  Hq  is  bounded  by  where  r  is  the  largest  single-step  expected  reward,  we  have  folded 
a  similar  dependence  into  the  constant  L. 

The  sort  of  error  bound  which  we  have  proved  is  particularly  useful  for  function  approximators  such  as 
linear  interpolation  which  have  many  fixed  points.  (In  fact,  every  function  which  is  representable  by  a  linear 
interpolator  is  a  fixed  point  of  that  interpolator’s  mapping.  The  same  is  true  for  bilinear  interpolation,  grids, 
and  1-nearest-neighbor;  it  is  not  true  for  local  weighted  averaging  or  fc-nearest-neighbor  for  k  >  1.)  Because 
it  depends  on  the  maximum  difference  between  V*  and  the  fixed  point  of  Ma  ,  the  error  bound  is  not  very 
useful  if  V*  may  have  large  discontinuities  at  unknown  locations:  if  V*  has  a  discontinuity  of  height  d,  then 
any  averager  which  can’t  mimic  the  location  of  this  discontinuity  exactly  will  have  no  representable  functions 
(and  therefore  no  fixed  points)  within  |  of  F* . 

6.2  In  practice 

The  most  common  problem  with  approximate  value  iteration  is  the  presence  of  barriers  in  the  derived  MDP. 
That  is,  sometimes  the  derived  MDP  can  be  divided  into  two  pieces  so  that  the  first  piece  contains  the  goal 
and  the  second  piece  has  no  transitions  into  the  first.  In  this  case,  the  estimated  values  of  the  states  in  the 
second  piece  will  be  infinite.  (We  mentioned  a  special  case  of  this  situation  above:  if  the  averager  ignores 
the  goal  state,  then  the  derived  MDP  will  have  no  transitions  into  the  goal.)  A  less  drastic  but  similar 
problem  occurs  when  the  second  piece  has  only  low-probability  transitions  to  the  first;  in  this  case,  the  costs 
for  states  in  the  second  piece  will  not  be  infinite,  but  will  still  be  artificially  inflated. 

This  sort  of  problem  is  likely  to  happen  when  the  MDP  has  short  transitions  and  when  there  are  large 
regions  where  a  single  state  dominates  the  averager.  For  a  particularly  bad  example,  suppose  our  function 
approximator  is  1-nearest-neighbor.  If  the  transitions  out  of  a  sampled  state  ar  in  M  are  shorter  than  half 
the  distance  to  the  nearest  adjacent  sampled  state,  then  the  only  transitions  out  of  x  in  M'  will  lead  straight 
back  to  X.  Similarly,  in  local  weighted  averaging  with  a  narrow  kernel,  a  short  transition  out  of  a;  in  M  will 
translate  to  a  high  probability  self  loop  in  M'.  In  both  cases,  the  effect  of  the  averager  is  to  produce  a  drag 
on  transitions  out  of  x:  actions  in  M'  don’t  get  the  agent  as  far  on  average  as  they  did  in  M. 

One  way  to  reduce  this  drag  is  to  make  sure  that  no  single  state  has  the  dominant  weight  over  a  large 
region.  The  best  way  to  do  so  is  to  sample  the  state  space  more  densely;  but  if  we  could  afford  to  do 
that,  we  wouldn’t  need  a  function  approximator  in  the  first  place.  Another  way  is  to  increase  a  smoothing 
parameter  such  as  kernel  width  or  number  of  neighbors,  and  so  reduce  the  weight  of  each  sample  point  in 
its  immediate  neighborhood.  Unfortunately,  increased  smoothing  brings  its  own  problems:  it  can  remove 
exactly  the  features  of  the  value  function  that  we  are  interested  in.  For  example,  if  the  agent  must  follow  a 
long,  narrow  path  to  the  goal,  the  scattering  effect  of  a  wide-kernel  averager  is  almost  certain  to  push  it  off 
of  the  path  long  before  it  reaches  the  end.  We  will  see  an  example  of  this  problem  in  the  hill-car  experiment 
below. 

Both  of  the  above  problems  —  too  much  smoothing  and  the  introduction  of  barriers  —  can  be  reduced 
if  we  can  alter  our  MDP  so  that  the  actions  move  the  agent  farther.  For  example,  we  might  look  ahead  two 
or  more  time  steps  at  each  value  backup.  (This  strategy  corresponds  to  the  dynamic  programming  operator 
TJ^  o  Ma  for  some  n  >  1.  Since  is  the  backup  operator  for  an  MDP  (derived  by  composing  n  copies 
of  M),  the  previous  sections’  convergence  theorems  also  apply  to  o  Ma-)  While  in  general  the  cost 
of  looking  ahead  n  steps  is  exponential  in  n,  there  are  many  circumstances  where  we  can  reduce  this  cost 
dramatically.  For  instance,  in  a  physical  simulation,  we  can  choose  a  longer  time  increment;  in  a  grid  world, 
we  can  consider  only  the  compound  actions  which  don’t  contain  two  steps  in  opposite  directions;  and  in  the 
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case  of  a  Markov  process,  where  there’s  only  1  action,  the  cost  of  lookahead  is  linear  rather  than  exponential 
in  n.  (In  the  last  case,  TD(A)  [Sut88]  allows  us  to  combine  lookaheads  at  several  depths.)  If  actions  are 
selected  from  an  interval  of  R,  numerical  minimum-finding  algorithms  such  as  Newton’s  method  or  golden 
section  search  can  find  a  local  minimum  quickly.  In  any  case,  if  the  depth  and  branching  factor  are  large 
enough,  standard  heuristic  search  techniques  can  at  least  chip  away  at  the  base  of  the  exponential. 


7  Experiments 

This  section  describes  our  experiments  with  the  Markov  decision  problems  from  [BM95]. 

7.1  Puddle  world 

In  this  world,  the  state  space  is  the  unit  square,  and  the  goal  is  the  upper  right  corner.  The  agent  has  four 
actions,  which  move  it  up,  left,  right,  or  down  by  .1  per  step.  The  cost  of  each  action  depends  on  the  current 
state:  for  most  states,  it  is  the  distance  moved,  but  for  states  within  the  two  “puddles,”  the  cost  is  higher. 
See  figure  6. 

For  a  function  approximator,  we  will  use  bilinear  interpolation,  defined  as  follows:  to  find  the  predicted 
value  at  a  point  (x,y),  first  find  the  corners  (aiQ, 2/0)1  (a^0i2/i)i  (®iiJ/o)i  and  (a:i,2/i)  of  the  grid  square 
containing  {x,y).  Interpolate  along  the  left  edge  of  the  square  between  {xo,yo)  and  {xo,yi)  to  find  the 
predicted  value  at  (iqiJ/)'  Similarly,  interpolate  along  the  right  edge  to  find  the  predicted  value  at  (xi,?/)- 
Now  interpolate  across  the  square  between  (aJoiJ/)  and  (xi,?/)  to  find  the  predicted  value  at  {x,y). 

Figure  6  shows  the  cost  function  for  one  of  the  actions,  the  optimal  value  function  computed  on  a  100  x  100 
grid,  an  estimate  of  the  optimal  value  function  computed  with  bilinear  interpolation  on  the  corners  of  a  7  x  7 
grid  (i.e.,  on  64  sample  points),  and  the  difference  between  the  two  estimates.  Since  the  optimal  value 
function  is  nearly  piecewise  linear  outside  the  puddles,  but  curved  inside,  the  interpolation  performs  much 
better  outside  the  puddles:  the  root  mean  squared  difference  between  the  two  approximations  is  2.27  within 
one  step  of  the  puddles,  and  .057  elsewhere.  (The  lowest-resolution  grid  which  beats  bilinear  interpolation  s 
performance  away  from  the  puddles  is  20  x  20;  but  even  a  5  x  5  grid  can  beat  its  performance  near  the 
puddles.) 

7.2  Car  on  a  hill 

In  this  world,  the  agent  must  drive  a  car  up  to  the  top  of  a  steep  hill.  Unfortunately,  the  car’s  motor  is  weak: 
it  can’t  climb  the  hill  from  a  standing  start.  So,  the  agent  must  back  the  car  up  and  get  a  running  start.  The 
state  space  is  [ — 1, 1]  x  [ — 2,2],  which  represents  the  position  and  velocity  of  the  car;  there  are  two  actions, 
forward  and  reverse.  (This  formulation  differs  slightly  from  [BM95]:  they  allowed  a  third  action,  coast.  We 
expect  that  the  difference  makes  the  problem  no  more  or  less  difficult.)  The  cost  function  measures  time 
until  goal. 

There  are  several  interesting  features  to  this  world.  First,  the  value  function  contains  a  discontinuity 
despite  the  continuous  cost  and  transition  functions:  there  is  a  sharp  transition  between  states  where  the 
agent  has  just  enough  speed  to  get  up  the  hill  and  those  where  it  must  back  up  and  try  again.  Since 
most  function  approximators  have  trouble  representing  discontinuities,  it  will  be  instructive  to  examine  the 
performance  of  approximate  value  iteration  in  this  situation.  Second,  there  is  a  long,  narrow  region  of 
state  space  near  the  goal  through  which  all  optimal  trajectories  must  pass  (it  is  the  region  where  the  car 
is  partway  up  the  hill  and  moving  quickly  forward).  So,  excessive  smoothing  will  cause  errors  over  large 
regions  of  the  state  space.  Finally,  the  physical  simulation  uses  a  fairly  small  time  step,  .03  seconds,  so  we 
need  fine  resolution  in  our  function  approximator  just  to  make  sure  that  we  don’t  introduce  a  barrier. 

The  results  of  our  experiments  appear  in  figure  7.  For  a  reference  model,  we  fit  a  128  x  128  grid.  While 
this  model  has  16384  parameters,  it  is  still  less  than  perfect:  the  right  end  of  the  discontinuity  is  somewhat 
rough.  (Boyan  and  Moore  used  a  200  by  200  grid  to  compute  their  optimal  value  function,  and  it  shows 
no  perceptible  roughness  at  this  boundary.)  We  also  fit  two  smaller  grids,  one  64  x  64  and  one  32  x  32. 
Finally,  we  fit  a  weighted  4-nearest  neighbor  model  using  the  1024  centers  of  the  cells  of  the  32  x  32  grid  as 
sample  points,  and  another  using  a  uniform  random  sample  of  1000  points  from  the  state  space.  Note  that 
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Figure  6:  The  puddle  world.  From  top  left:  the  cost  of  moving  up,  the  optimal  value  function  as  seen  by  a 
100  X  100  grid,  the  optimal  value  function  as  seen  by  bilinear  interpolation  on  the  corners  of  a  7  x  7  grid, 
and  the  difference  between  the  two  value  functions. 
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Figure  7:  Approximations  to  the  value  function  for  the  hill-car  problem.  From  the  top:  the  reference  model, 
a  32  X  32  grid,  a  fc-nearest-neighbor  model,  the  error  of  the  32  x  32  grid,  and  the  error  of  thejr-neares 
neighbor  model.  In  each  plot,  the  horizontal  axes  represent  the  agent’s  positioi^and  velocity,  and  the  vertica 
axis  represents  the  estimated  time  remaining  until  reaching  the  summit  at  ar  =  .6. 


Figure  8:  Two  smaller  models  for  the  hill-car  world:  a  divergent  12  x  12  grid,  and  a  convergent  nearest- 
neighbor  model  on  the  same  144  sample  points. 

the  nearest-neighbor  methods  are  roughly  comparable  in  complexity  to  the  32  x  32  grid:  each  one  requires 
us  to  evaluate  about  two  thousand  transitions  in  the  MDP  for  every  value  backup. 

As  the  difference  plots  show,  most  of  the  error  in  the  smaller  models  is  concentrated  around  the  dis¬ 
continuity  in  the  value  function.  Near  the  discontinuity,  the  grids  perform  better  than  the  nearest-neighbor 
models  (as  we  would  expect,  since  the  nearest-neighbor  models  tend  to  smooth  out  discontinuities).  But 
away  from  the  discontinuity,  the  nearest-neighbor  models  win.  The  32  x  32  nearest-neighbor  model  also 
beats  the  32  x  32  grid  at  the  right  end  of  the  discontinuity:  the  car  is  moving  slowly  enough  here  that  the 
grid  thinks  that  one  of  the  actions  keeps  the  car  in  exactly  the  same  place.  The  nearest-neighbor  model, 
on  the  other  hand,  since  it  smooths  more,  doesn’t  introduce  as  much  drag  as  the  grid  does  and  so  doesn’t 
have  this  problem.  The  root  mean  square  error  of  the  64  x  64  grid  (not  shown)  from  the  reference  model  is 
0.190s,  and  of  the  32  x  32  grid  is  0.336s.  The  RMS  error  of  the  4-nearest-neighbor  fitter  with  samples  at 
the  grid  points  is  0.205s.  The  nearest-neighbor  fitter  with  a  random  sample  (not  shown)  performs  slightly 
worse,  but  still  significantly  better  than  the  32  x  32  grid  (one-tailed  t-test  gives  p  =  .971):  its  error,  averaged 
over  5  runs,  is  0.235s. 

All  of  the  above  models  are  fairly  large:  the  smallest  one  requires  us  to  evaluate  2000  transitions  for 
every  value  backup.  Figure  8  shows  what  happens  when  we  try  to  fit  a  smaller  model.  The  12  x  12  grid  is 
shown  after  60  iterations;  it  is  in  the  process  of  diverging,  since  the  transitions  are  too  short  to  reach  the 
goal  from  adjacent  grid  cells.  The  4-nearest-neighbor  fitter  on  the  same  144  grid  points  has  converged;  its 
RMS  error  from  the  reference  model  is  0.278s  (better  than  the  32  x  32  grid,  despite  needing  to  simulate 
fewer  than  one-seventh  as  many  transitions).  A  4-nearest-neighbor  fitter  on  a  random  sample  of  size  150 
(not  shown)  also  converged,  with  RMS  error  0.423s. 

8  Conclusions  and  further  research 

We  have  proved  convergence  for  a  wide  class  of  approximate  temporal  difference  methods,  and  shown  exper¬ 
imentally  that  these  methods  can  solve  Markov  decision  processes  more  efficiently  than  grids  of  comparable 
accuracy. 

Unfortunately,  many  popular  function  approximators,  such  as  neural  nets,  linear  regression,  and  CMACs, 
do  not  fall  into  this  class  (and  in  fact  can  diverge).  The  chief  reason  for  divergence  is  exaggeration:  the  more 
a  method  can  exaggerate  small  changes  in  its  target  function,  the  more  often  it  diverges  under  temporal 
differencing.  (In  some  cases,  though,  it  is  possible  to  detect  and  compensate  for  this  instability.  The 
grow-support  algorithm  of  [BM95],  which  detects  instability  by  interspersing  TD(1)  “rollouts,”  is  a  good 
example.) 

There  is  another  important  difference  between  averagers  and  methods  like  neural  nets.  This  difference 
is  the  ability  to  allocate  structure  dynamically:  an  averager  cannot  decide  to  concentrate  its  resources  on 
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one  important  region  of  the  state  space,  whether  or  not  this  decision  is  justified.  This  ability  is  important, 
and  it  can  be  grafted  on  to  averagers  (for  example,  adaptive  sampling  for  A;-nearest-neighbor,  or  adaptive 
meshes  for  grids  or  interpolation).  The  resulting  function  approximator  still  does  not  exaggerate;  but  it  is 
no  longer  an  averager,  and  so  is  not  covered  by  this  paper’s  proofs.  Still,  methods  of  this  sort  have  been 
shown  to  converge  in  practice  [Moo94,  Moo91],  so  there  is  hope  that  a  proof  is  possible. 


A  The  derived  process  for  Q-learning 

Here  is  the  analog  for  Q-learning  of  the  derived  MDP  theorem.  The  chief  difference  is  that,  where  the 
theorem  for  value  iteration  considered  the  combined  operator  Tm  o  Ma  ,  this  version  considers  Ma  °  Tm  ■ 
The  difference  is  necessary  to  keep  the  min  operation  in  the  Q-learning  backup  from  getting  in  the  way.  Of 
course,  if  we  show  that  either  Tm  °  or  o  Tm  converges  from  any  initial  guess,  then  the  other  must 
also  converge. 

Theorem  A.l  (Derived  MDP  for  Q-learning)  For  any  averager  A  with  mapping  Ma,  and  for  any 
MDP  M  (either  discounted  or  nondiscounted)  with  Q-learning  backup  operator  Tm,  ihe  function  Ma  oTm 
is  the  Q-learning  backup  operator  for  a  new  Markov  decision  process  M' . 

Proof:  The  domain  of  A  will  now  be  pairs  of  states  and  actions.  Write  /?jray6  for  the  coefficient  of  Q(l/,b) 
in  the  approximation  of  Q(x,a)',  write  k^a  S'Hd  jdxa  for  the  constant  and  its  coefficient. 

Take  an  initial  guess  Q{x,(i).  Write  Q'  for  the  result  of  applying  Tm  to  Q;  write  Q  for  the  result  of 
applying  Ma  to  Q' .  Then  we  have 

Q'{x,a)  = 


Q"{z,c)  = 


We  now  interpret  the  first  parenthesis  above  as  the  cost  of  taking  action  c  from  state  2:  in  M';  the  second 
parenthesis  is  the  transition  probability  p'^cy  for  M' .  Note  that  the  sum  Pzcy  generally  be  less  than 
1 ;  so  we  will  make  up  the  difference  by  adding  a  transition  in  M'  from  state  z  with  action  c  to  state  1  (which 
is  assumed  as  before  to  be  cost-free  and  absorbing  and  to  have  V(l)  =  0).  ^ 


E{c{x,a)  +  7minQ(5(x,a),6)) 
b 

Cra  +  7  ^  Pray  mm  Qiy,  b) 
y 

EE  PzcxaQ'{x,a)  +  fdzckzc 
X  a 

EE  ^zcxa  I  ^xa  7  ^  j  Pxay  Q{y>  I  Pzc^zc 

X  a  \  y  / 

EE  Pzcxa^xa  +  7  EE  i^zcxa  '^Pxay  ramQ{y,b)  +  (izc^zc 

X  a  X  a  y 

EE  Pzcxa^xa  l^zc^zc  j  “i"  T  E  EE  PzcxaPnay  j  minQ(3/,6) 

\  X  a  /  y  \  X  a  J 
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